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Abstract 

In order to obtain more accurate solutions of polynomial systems with numerical continu- 
ation methods we use multiprecision arithmetic. Our goal is to offset the overhead of double 
double arithmetic accelerating the path trackers and in particular Newton's method with a 
general purpose graphics processing unit. In this paper we describe algorithms for the mas- 
sively parallel evaluation and differentiation of sparse polynomials in several variables. We 
report on our implementation of the algorithmic differentiation of products of variables on the 
NVIDIA Tesla C2050 Computing Processor using the NVIDIA CUDA compiler tools. 

Key words and phrases, algorithmic differentiation, compute unified device architecture 
(CUDA), graphics processing unit (GPU), massively parallel polynomial evaluation, Speelpen- 
ning product. 

1 Introduction 

The problem we consider in this paper is the efficient evaluation of a polynomial system and 
its Jacobian matrix on a graphics processing unit (GPU), the NVIDIA Tesla C2050 Computing 
Processor. For an introduction to GPU computing, we refer to [17] and [29]. The success of 
general purpose GPU computing in many areas of scientific computing is explained in |16j . 

The evaluation of a polynomial system and its Jacobian matrix is a computationally intensive 
stage in Newton's method to approximate an isolated solution. Numerical continuation methods 
apply Newton's method as corrector in predictor-corrector algorithms to track paths of solutions 
defined by a homotopy. Homotopy continuation methods have led to efficient numerical solvers 
of polynomial systems (see e.g. [2], [23], [28], [22]) and constitute the computational engine in the 
emerging area of numerical algebraic geometry ([IS], [34], [35]). 
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Granularity issues and parallel complexity of continuation methods for nonlinear systems are 
discussed in [3] and [6]. If one is interested in computing all isolated solutions of a polyno- 
mial system, then distributing path tracking jobs in a manager /worker paradigm using message 
passing |33j leads to very good speedups. Such parallel implementations are in Bertini [3], 
H0M4PS-2 . Opara [23], PHoMpara [T3], POLSYS_GLP [SB], and PHCpack [37j, documented in [33], 
[IB EE], [22], [38], and [UJ. 

For large polynomial systems in many variables and of high degrees we have observed that 

(1) the cost of polynomial evaluation often dominates the cost of linear algebra operations; and 

(2) the double precision in standard hardware is often insufficient to guarantee accurate results. 
When running many path tracking jobs, a couple or perhaps just one solution path may require 
extended multiprecision arithmetic. Then we want to apply parallel algorithms to offset the extra 
cost of software driven arithmetic. In analogy to speedup, we use the notion of quality up (inspired 
by [I]) and ask the question: given p processors (or cores) how much extra precision can we afford 
in roughly the same time as a sequential run? 

Because of simple memory management on shared memory multicore processing we selected 
the quad double library QD 2.3.9 |15] . The ideas to achieve extended precision using hardware 
doubles originate in [7J, see also [30], [31] and [32]. In [30], we determined experimentally that 
the cost factor in the overhead of using double double arithmetic is around 8, coinciding with 
the number of cores on the workstation we were using at that time. Then the cost of tracking 
one solution path in double double arithmetic can be compensated in a parallel multicore im- 
plementation, thus achieving quality up. Using techniques of algorithmic differentiation [12], we 
extended this work in |39j . 

This current paper describes our efforts to offset the cost of extended precision by a parallel 
implementation of evaluation and differentiation algorithms on a GPU. Recently, quad double 
precision arithmetic has been made available on GPU, see [25 . Our long term goal is use GPUs 
to accelerate the solver of PHCpack [37] . 

Related work in algebraic computations on a GPU are polynomial multiplication [H], [27] , 
resultant [9], GCD computations [10], and solving bivariate polynomial systems [26] . Parallel 
automatic differentiation techniques are described in [5] and |11| . 

2 Problem Statement 

The problem we consider is the evaluation of system of polynomial equations in several variables 
and all its derivatives (as needed in the Jacobian matrix of the system). Let n denote the number 
of variables. A polynomial / in 7i variables x — (^i^ %2i ■ ■ ■ ? 

) is stored clS cl tuple (C,A) of 
complex coefficients C and corresponding exponents A. In multi-index notation we write / as 

/(x) = ]Tc a x a , ceC\{0}, x a = xr4 2 (1) 

Then a system f(x) = is defined by a tuple of coefficients and supports. 

In our problem setup, we consider as inputs sparse polynomials, that is: only relatively few 
monomials appear with nonzero coefficients, few relative to the degree of the polynomials. For 
dense polynomials, a nested Horner scheme is recommended, see [18]. Because of the exponential 
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growth of the number of monomials, dense polynomials in several variables of high degree do not 
occur often in applications. For establishing benchmarks we consider in this paper systems with 
a fixed number k of variables in monomials, a fixed maximal degree d up to which any of variables 
can appear in monomials of the system, and a fixed number m of monomials in all polynomials. 

There are three stages in the evaluation of a polynomial system and its Jacobian matrix: 

1. The computation of the monomial products x^ 1 ^"j 2 1 • • • x^ fe 1 where the exponents 
o-ij > 1, j = 1,2, ...,k, and 1 < i\ < i% < . . . < iu < n. A preprocessing step is the 
computation of all powers of x?, for all i € {1, 2, . . . , n} and j G {2, . . . , d — 1}. 

2. The evaluation of products of variables x^x^ ■ ■ ■ Xi k , and all their derivatives. The product 
of variables x^x^ ■ ■ ■ Xi k is known in [12] as the example of Speelpenning. In this stage 
we also multiply Speelpenning products and their derivatives with the values of mono- 
mial products computed in the previous stage to obtain the values of monomials and their 
derivatives. 

3. The multiplication of the coefficients with the corresponding evaluated monomials, followed 
by summations of obtained products within hosting polynomials of the system and the 
Jacobian. 

While the evaluation of high dimensional and high degree polynomials is computationally 
intensive, the challenge for data parallelism lies in the irregularity of the data. In order to achieve 
good speedups, we will derive regularity assumptions on the input data. 

3 Massively Parallel Algorithms 

Following the three stages (outlined in the problem statement section) in the evaluation of a 
polynomial system and its Jacobian matrix, we devote one subsection to each stage. There 
are three kernels. The first kernel corresponds to the first stage, whereas the second kernel is 
described in subsections 2 and 3 below. We explain the third kernel in the second half of the 
third subsection. 

3.1 Common Factor Calculation 

To evaluate a monomial x\x\x\ and its derivatives, we first evaluate the factor x\x^xz and then 
multiply this factor with all derivatives of X1X2X3. Because xfx^x^ is common to the evaluated 
monomial and all its derivatives, we call x\x^x^ a common factor. This section is concerned with 
the evaluation of all common factors. 

The kernel to compute common factors operates in two stages: 

1. each of the first n threads of a thread block computes sequentially powers from the 2nd to 
the (d — l)th of one of the n variables; 

2. each of the threads of a block computes a common factor for one of the monomials of the 
system, just as a product of k quantities computed at the first stage of the kernel. 
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Storing the values of the successive variables in the successive locations of the global memory 
enables a coalesced reading by the threads of a warp of the values of the variables of the system 
from the global memory into the shared memory, as an input for the first stage of the kernel. 

Both stages of the kernel are largely SIMT (Single Instruction Multiple Thread) routines since 
at the first stage each of the non idle threads just performs d— 1 multiplications, and at the second 
stage each thread in each warp just performs k — 1 multiplications. 

The precomputed at the first stage powers of variables, are stored at the shared memory of 
the blocks, since these powers essentially constitute shared input data for the threads of the block 
while the threads are working on the second stage of the kernel. The powers are stored in shared 
memory in a two dimensional array Powers of complex numbers, an (i,j)th element of which 
represents the ith power of the jth variable. Such indexing is supposed to minimize a number of 
shared memory bank conflicts at least during the first stage of the kernel, as different threads in 
a warp, after computing the current power of associated to them variables, will be writing the 
power values into different banks of the shared memory. 

As the threads of a block perform the second stage of the kernel, each of them computes a 
product of k quantities, computed at the first stage. As a thread proceeds to each next element 
in a product, to know what element to access in the shared memory array POWERS, it would 
need to know which variable and what exponent appears next in the assigned to it monomial. 
The information about positions of variables and their exponents does not change along the path 
tracking, and thus might be stored in constant memory of the card. We reserve two arrays of 
unsigned chars POSITIONS and EXPONENTS in the constant memory to represent this information. 
Each element in POSITIONS represents a position of a variable from to 255 in one of monomials 
of the system, and the element with the same index in Exponents represents the degree of this 
variable decreased by one in the same monomial, giving us opportunity to work with variables 
appearing in degrees up to 255. 

We need at least about 1,000 monomials to occupy well all the 14 multiprocessors of our 
card for the algorithms we consider here, so several warps would work on each multiprocessor 
simultaneously to hide long latency operations. This and the capacity of the constant memory, 
65,536 bytes, prescribes working dimensions for our polynomial evaluation: those are ranging 
from 30 to 40. If we want to keep m, the number of monomials in the polynomial, to be equal 
roughly to the dimension of the system, and k, the number of variables in the monomial, about 
half of the dimension. Indeed: for dimension 30 we would have 900 monomials, with a need of 
900 x 2 x 15 < 30,000 bytes; for dimension 40 we would have 1,600 monomials, with a need of 
1, 600 x 2 x 20 = 64, 000 bytes. 

We are planning to introduce more compact encodings for storing the positions and exponents 
of the variables in the constant memory so to be working with higher dimensions. The more 
compact encodings might introduce some branching for the threads of a warp, after the decoded 
indexing information would be read from the constant memory into the registers of the block, 
while each thread in a warp would be encoding the actual position and exponent of the next 
variables power, which it needs to use for its computations. However the computations, which 
would follow encodings (the multiplications), where the threads of a warp will join again one 
path of execution, are supposed to dominate encodings in time, especially if higher precision 
multiplications would be used. Thus with new ways of decoding, incorporated to store more 
efficiently monomial information in the constant memory, and employed multiprecision, we hope 
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increase working dimensions for our implementation. 

After each thread of a block computes its common factor, the successive threads of the block 
conveniently write their output values (one value per thread) into successive locations of the 
global memory, thus providing a coalesced output for the kernel. 

As an alternative to computing common factors in the two above stages, one can skip precom- 
puting powers, and assign to each thread all work, which is necessary for computing of assigned to 
it common factor, to do by itself from scratch. This could be done entirely in registers assigned to 
a block, without any use of shared memory. However this would introduce branching in execution 
of threads of a warp when monomials would have different tuples of exponents, and if one would 
choose that each thread would compute all powers up to d — 1 for participating in its monomial 
variables, it would most likely cause multiple exponentiation of the same variables by threads 
within warps since the same variables tend to appear in multiple monomials of a system. In our 
algorithm powers of variables are also computed multiple times - each block of threads computes 
its own copy of the set of powers from 2nd to d — 1 for all n variables of the system. 

This might look as a drawback of the algorithm. This is not really so. For our working 
dimensions ranging from 30 to 40, and the number of maximal cores for one multiprocessor 32, 
we would need to assign at most two blocks to work on precomputing degrees if we want to do 
it only once, in this case 12 of 14 multiprocessors would be idle during precomputing powers of 
variables. Also to start using the other 12 multiprocessors for the second stage of computing 
common factors, we would need to write the precomputed powers into the global memory, then 
to invoke a separate kernel with enough blocks to occupy all multiprocessors, and then threads 
of each block of the new kernel will access the global memory again for reading the powers 
of variables stored there. Our algorithm, as an alternative to prompted by the just described 
two kernels scheme additional time cost for global memory reading and writing, introduces the 
additional time cost, which is illustrated well by the following example. 

Consider a system of dimension 32 with 28 monomials in each polynomial. If we will work 
with blocks of 32 threads, 28 blocks of threads will be launched. Then, in the worst case, if only 
one block will be occupying one multiprocessor at a time, the execution time for our two-stages 
kernel will be the same as if one block of 32 threads would be launched two times in a row. Thus 
precomputing powers, despite in fact it would be done 28 times, time-wise would take the same 
amount of time as it would be done twice. Then, as within one thread block powers of all variables 
are computed in parallel, for our example then precomputing degrees would take in the worst 
case the same time as is needed for one core to compute 2{d — 2) multiplications (variables for the 
common factors need to be raised up to the power d — 1, which requires d — 2 multiplications). 
The degree d is in most cases not that high (while still allowing high total degrees for monomials) , 
thus multiple precomputing degrees in our two stages one kernel algorithm in most cases would 
compensate for the additional necessary global memory accesses as the powers are precomputed 
only once, and most likely, even reduce the computational time for precomputing powers. 

3.2 Monomial Evaluation and Differentiation of Products of Variables 

In this section we describe the implementation of the algorithm to evaluate a product of variables 
Xi ± Xi 2 ■ ■ ■ Xi k and all its derivatives. We call this product of variables a Speelpenning product. 

In our second kernel each thread first computes one monomial and its partial derivatives. 
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Secondly it multiplies the computed value of the monomial by its coefficient in the hosting that 
monomial polynomial of the system, as well as it multiplies the values of the computed derivatives 
of the monomial by their coefficients in the hosting those monomial derivatives polynomials of 
the Jacobian. Thus this kernel completes computing additive terms of the polynomials of the 
system and the Jacobian, and the third last kernel only adds appearing in each polynomial terms 
to finish polynomial evaluation. 

A thread of the second kernel performs only 5k — 4 multiplications and uses k + 1 complex dou- 
ble locations of shared memory L±, L2, ■ ■ ■ , £fc+i and one variable in registers to perform all the 
announced above work. As was discussed in the previous section, through an example, we obtain 
the partial derivatives of a monomial x a by multiplying the common factor x^ 1 x^ 2 ■ ■ ■ x"^ 
by the partial derivatives of x^x^ ■ ■ ■ Xi k . It takes 3k — 6 multiplications out of 5k — 4 multipli- 
cations performed by a thread to compute partial derivatives of Speelpenning product. Another 
k multiplications are performed to multiply the common factor by the values of derivatives of 
Speelpenning product to obtain monomial derivatives. One additional multiplication is done to 
obtain the value of the monomial itself as a product of a derivative of the monomial with respect 
to any of participating in it variables and the value of that variable. Finally a thread performs 
another k + 1 multiplications to multiply the values of the monomial and its derivatives by the 
coefficients. 

To obtain derivatives of Speelpenning product a thread first stores x^ in the location L2. 
Then it computes sequentially, by k — 2 multiplications, the k — 2 forward products x^x^, 
Xi 1 Xi 2 Xi 3 , . . . , Xi 1 Xi 2 Xi 3 ■ ■ ■ Xi k _ 1 , for each new r ranging from 1 to k — 2 obtaining the product 
Xi 1 Xi 2 Xi 3 ■ ■ ■ Xi r+1 as (xi 1 Xi 2 Xi 3 • • • Xi r )xi r+1 and storing the newly obtained forward product into 
location L r+ 2- Eventually the locations £3,...,!^ are filled with the k — 2 obtained forward 
products. Note that at this point the location contains the derivative of the Speelpenning 
product with respect to Xi k . In registers of the block we keep the only complex double variable 
Q to store the current backward product. We initialize Q with X{ k . A thread computes the 
derivative of the Speelpenning product with respect to Xi k _ 1 at L^-i by multiplying stored in 
that location the forward product Xi 1 Xi 2 Xi 3 • • • Xi k _ 2 by the current value of Q, which is Xi k . 

In the next k — 3 steps, each of which consists of two multiplications, we compute partial 
derivatives of the Speelpenning product with respect to Xi 2 , x« 3 , . . . , Xi k _ 2 , and store the computed 
values in locations L2, £3, . . . , Lu-2- At the rth step, as r ranges from 1 to k — 3, the Q represents 
the backward product Xi k Xi k _ 1 • • • xi k _ r . At the rth step we first update the value of Q, accordingly 
to its above definition, by one multiplication as Q = Q x Xi k _ r . The second multiplication updates 
the shared memory location Lk- r -i as Lk- r -i = £fc-r-i x Q, so to obtain in this location the 
partial derivative of Speelpenning product with respect to product of previously stored 

there forward product Xi 1 Xi 2 Xi 3 • • • X{ k _ r _ 2 times the current backward product Xi k Xi k _ 1 ■ ■ ■ X{ k _ r . 

Finally we obtain the last yet not obtained partial derivative of Speelpenning product with 
respect to x; L1 at Q, by the product Q = Q x x, 2 and store the obtained value at the shared 
memory location L\. 

The above procedure prescribes to a thread to perform k — 2 multiplications to obtain the 
forward products, k — 2 multiplications to obtain the backward products, and k — 2 multiplications 
of backward and forward products. Thus indeed, the total number of multiplications for obtaining 
all derivatives of the Speelpenning product equals 3k — 6. 
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Now a thread computes monomial derivatives in locations L±, L2, ■ ■ ■ , by multiplying stored 
in these locations values of derivatives of Speelpenning product by the common factor computed in 
the first kernel. Then it computes the value of the monomial itself as the product of its derivative 
with respect to Xi k , stored in times the value of Xi k . It stores the computed monomial value 
at Finally it multiplies each of the values stored in L 1; L 2 , . . . , Lfc+i, i.e.: the values of the 

monomial and its derivatives, by the corresponding coefficients. 

As k - the number of variables in a monomial - is the same for all monomials of the system, each 
thread of the second kernel will go through the same path of execution for the entire list of 
instructions of the kernel, which largely amounts to 5k — 4 complex double multiplications. Thus 
all 32 threads within each warp will be indeed doing all the prescribed work for the assigned to 
them 32 monomials in a parallel fashion on an available multiprocessor during the execution. 

We close this subsection with some memory considerations. Denote by B the block size, the 
number of threads in a block. 

In addition to the fast access space of B(k + 1) locations equally divided between threads of 
a block for storing their intermediate results, as the threads proceed along the kernel, we reserve 
in shared memory of a block a space for values of all variables of the system. The values of 
the variables are subject shared use of the threads of each block, as the same variables appear 
in different monomials. Thus, provided values of successive variables are stored in successive 
locations of global memory, and working with n = 32, k = 16, B = 32, we would need to access 
global memory only once by all threads of a block simultaneously, as each thread would request 
a value of one variable, to download the values of all 32 variables into the shared memory of the 
block for their further common use by all threads of the block. 

At the same time, if shared memory would not be used for storing values of variables, each 
thread would need to access global memory at least 16 times to get the values of all appearing 
in its monomial variables. The shared memory capacity allows us to apply the above algorithm 
of the second kernel for our working dimensions ranging between 30 and 40 as well as for some 
larger dimensions. We also could increase precision from double to double double and still work 
with dimensions up to 70, as long as k is less or equal than a half of dimension. Indeed, each 
thread would need for treating its monomial k + 1 complex double double locations, thus 

(n/2 + 1) x 2 x sizeof (double double) 

< (70/2 + 1) x 2 x 16 = 1152 

bytes in shared memory. To treat 32 monomials by a block of 32 threads we would need then at 
most 32 x 1152 = 36864 bytes of shared memory. Adding to this 

n x sizeof (complex double double) 
< 70 x 2 x sizeof (double double) 
= 70 x 2 x 16 = 2240 

bytes in shared memory for storing values of the variables, we are still (49, 152 — (36, 864 + 
2, 240)) > 10, 000 bytes below the maximal capacity of the shared memory of a block. 

Another important note about the memory management is that the array POSITIONS in 
constant memory, which contains positions indexes of variables in the monomials, and used in 
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the first kernel, is used in this kernel as well, as threads are determining what variable in the 
shared memory to access as they need to perform each new multiplication while updating their 
forward and backward products. 

3.3 Multiplication of Evaluated Monomials with Coefficients and Summation 
of Terms 

In the third stage the evaluated monomials first are multiplied with their coefficients in the 
polynomials of the system or the Jacobian. 

The coefficients are stored in the global memory, since the capacity of the constant memory is 
exhausted by the variables positions indexes and variables exponents information. As we multiply 
monomials and their derivatives by the coefficients, we need to read the values of coefficients from 
the global memory fast. The total number of monomials in the system is n x m. For mapping 
purposes all the monomials are ordered in a sequence S m of length n x m. For instance the 
monomials in S m might be ordered as following: first m elements of the sequence are the mono- 
mials of the first polynomial, the next m elements are the monomials of the second polynomial, 
and so on. The coefficients are stored during entire path tracking in an array COEFFS of length 
n x m x (k + 1), which is the total number of monomials in the system and its Jacobian. The 
coefficients in Coeffs are stored in the following order: 

• The first element of Coeffs is the coefficient of the derivative of the first monomial in S m 
with respect to its first variable; 

• the second element of Coeffs is the coefficient of the derivative of the second monomial in 
S m with respect to its first variable, and so on until 

• the nmth element of Coeffs, which is the coefficient of the derivative of the last monomial 
in S m with respect to its first variable. 

• The next n x m elements of Coeffs are the coefficients of the derivatives of monomials 
from S m , with respect to the monomials second variables, also listed in accordance with 
order in S m . 

The portions of nm coefficients come in a similar manner until the fcth portion of nm coefficients, 
in which are stored, in order inherited from S m , the coefficients of monomial derivatives with 
respect to the monomials A;th (last) variables. The last {k + l)th portion of the nm coefficients 
contains actually the coefficients of the system in order prescribed by order in S m . With this 
way of storing coefficients, if ith thread of the second kernel is in charge of ith. monomial in S m 
for each i = 1,2,..., nm, we largely obtain a coalesced access within warps, as threads of a warp 
prescribed simultaneously to access the coefficients of their monomials or the jth, j G {1, 2, . . . , k}, 
derivative's coefficients of their monomials in Coeffs. 

After multiplying the monomial and their derivatives values by coefficients, which is the last 
computational step of the second kernel, it is just left to add the corresponding computed additive 
terms to obtain the values of the polynomials of the system and of the Jacobian. If the size of 
a thread block used for the execution of the second kernel is smaller than m, then monomials 
of each polynomial of the system are treated by multiple blocks of the second kernel. In this 
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case, even if some of the involved summations are done yet by the threads of the second kernel, 
it is necessary to launch another kernel to combine partial sums which are obtained by different 
blocks of the second kernel, which are working on monomials of the same polynomials. The 
situation, when the size of a thread block of the second kernel is less than m, is very common for 
our working dimensions: we try to keep the block size of the second kernel equal to 32, because 
of described above shared memory limited capacity considerations, on the other hand, we are 
willing to work with higher dimensions, ranging from 50 to 70, while we want to keep m ~ n. 
Also, computing partial sums for polynomials of the Jacobian by threads of the second kernel 
would involve branching in execution paths of the threads within warps, as different subsets of 
variables appear in monomials treated by different threads within a warp. Because of the above 
reasons we decided to introduce a third kernel, which would perform all involved summations, so 
to complete obtaining the values of the polynomials, as all multiplicative operations are done by 
the first two kernels. 

Each thread of the third kernel sums additive terms of one of n 2 + n polynomials of the 
combined set of polynomials of the system and the Jacobian matrix. To make each thread to go 
through the same execution path, all what we assign to each thread to do during the execution 
of the kernel is to add exactly m terms. Thus, if a thread computes the value of the derivative 
of the pth polynomial with respect to Xi, and a jth monomial in the pth polynomial does not 
contain Xi, the thread which computes the derivative of the pth polynomial with respect to Xi, 
at the jth step does add to its current partial sum zero - the zero monomial derivative, which 
we probably never would add in a CPU execution. To ensure this, without introducing any if 
statements, the output array of the second kernel in the global memory along with its meaningful 
nm(k + 1) locations (the number of monomials and monomial derivatives of the system) contains 
also (n 2 + n)m — nm(k + 1) locations, the values at which are originally set and kept to store 
zero values along the entire path tracking. These zero locations represent the zero monomial 
derivatives as in the described above situation. We also wish that the threads within warps of 
the third kernel for each step j, j = 1, 2 ... m would perform a coalesced reading of the input 
data entries. To allow coalesced reading of the values of monomials and their derivatives by the 
threads of the third kernel, and to introduce the (n 2 + n)m — nm(fe + l) zero monomial derivatives, 
the output of the second kernel is stored in the global memory in array Mons in the format we 
explain next. 

The size of the array Mons is (n 2 +n)m, representing the terms in n 2 +n summations, m terms 
each. The first n 2 + n elements of the array represent the first terms in each of n 2 + n summations 
(polynomials). In particular, these first n 2 + n elements are: the first n elements are the first 
monomials of the polynomials of the system, the second n elements are the derivatives of the 
first monomials with respect to x±, the third n elements are the derivatives of the first monomials 
with respect to X2, and so on until the (n + l)th n elements, which are the derivatives of the first 
monomials with respect to x n . The second n 2 + n elements represent the second terms in each 
of n 2 + n summations, and again the first n elements of them represent the second monomials of 
the polynomials of the system, and the next n 2 elements represent the partial derivatives of the 
second monomials of the system, listed in the same order as are listed the derivatives of the first 
monomials. In general the jth n 2 + n elements represent jth. monomials of the polynomials of 
the system and their partial derivatives listed in the same order as listed the first monomials of 
the system and their partial derivatives at the first n 2 + n elements of the array. 
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For simplicity in this description we assumed that the number B of threads in a block, the block 
size, divides n 2 +n. Now if we launch (n 2 +n) / 'B blocks, with a thread t = Blockldx B+Threadld 
computing the sum: 

EJlo 1 MoNs[t + j(n 2 +n)}, the obtained sums will represent the values of 
polynomials of the system and of the Jacobian, while access to the elements of Mons will be 
coalesced within warps at each step j = 0, 1, . . . ,m — 1 of the summation. To create the array 
Mons in such a format, we had to make the threads of the second kernel to output the values of 
monomials and their derivatives not in a coalesced way. However there was a tradeoff: 

• either to make the output of the second kernel coalesced and then the input of the third 
kernel could not be accessed in a coalesced way, 

• or as we chosen to provide ability for the threads of the third kernel to read the input data 
in a coalesced way, in a price of not coalesced writing of the output of the second kernel. 

4 Computational Experiments 

Our computations are done on a HP Z800 workstation, running Red Hat Enterprise Linux Work- 
station release 6.1. The CPU is an Intel Xeon X5690 at 3.47 Ghz. The processor clock of the 
NVIDIA Tesla C2050 Computing Processor runs at 1147 Mhz. The graphics card has 14 multi- 
processors, each with 32 cores, for a total of 448 cores. As the clock speed of the GPU is a third 
of the clock speed of the CPU, we hope to achieve a double digit speedup. We used the NVIDIA 
CUDA compiler driver nvcc, release 4.0, VO.2.1221. 

In Table Q] and [2] we list results of our preliminary implementation. The number of threads 
in each block was 32 for all three kernels to evaluate a system and its Jacobian matrix of dimen- 
sion 32. Generating 32 monomials per polynomial leads to 1,024 monomial in total. 



^monomials 


Tesla C2050 


1 CPU core 


speedup 


704 


14.514 sec 


lmin 50.9 sec 


7.60 


1024 


15.265 sec 


2min 39.3 sec 


10.44 


1536 


17.000 sec 


3min 58.7 sec 


14.04 



Table 1: Wall clock times and speedups for 100,000 evaluations of a polynomial system and its 
Jacobian matrix of dimension 32. Each monomial has 9 variables occurring with nonzero power 
of at most 2. 



^monomials 


Tesla C2050 


1 CPU core 


speedup 


704 


19.068 sec 


3min 16.9 sec 


10.33 


1024 


20.800 sec 


4min 43.3 sec 


13.62 


1536 


21.763 sec 


7min 05.8 sec 


19.56 



Table 2: Wall clock times and speedups for 100,000 evaluations of a polynomial system and its 
Jacobian matrix of dimension 32. Each monomial has 16 variables occurring with nonzero power 
of at most 10. 
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Increasing the number of monomials to 2,048 in Table [T] and [2] would have yielded a speedup 
of more than 20, but the capacity of the constant memory was not sufficient to hold the expo- 
nents and positions of all 2,048 monomials. For larger systems, we will upgrade our preliminary 
implementation with a better compression strategy (instead of the current char used for each 
exponent). 

5 Conclusions 

Starting from an algorithm for evaluation and differentiation that is already close to optimal, we 
obtained good speedups on a graphics computing processor for randomly generated polynomial 
systems of dimension 32 (the warp size) and fixed number of monomials per polynomial in the 
system. 
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